INFO: Encoding special characters

Applies To:
eWebEditPro
Summary:

The HTML specification defines special characters for a set of punctuation symbols, accented letters, and a variety of non-Latin characters. Since the HTML specification has changed over time, so has the support for special characters in the browsers. For instance, Microsoft has defined a number of special characters that would (in the past) only display in Internet Explorer on Windows. They are extended characters that map to binary values 128 to 159. Depending on the version of your browser and operating system, and whether it was made by Microsoft, the characters may appear as expected or as a "?" or small rectangle. The W3C has now adopted most of these extended characters in HTML 4, but they are mapped to different binary values.

Choosing the wrong font face can also prevent the character from displaying. This is a common problem when copying from Microsoft Word, where many of the special characters are in the Symbol font. If the Symbol font is not available in the browser or not permitted in the editor, the character will display as some other character.

For example, the Euro symbol was designed for the European Economic Community (EEC) in the late 1990's. Obviously operating systems and browsers created earlier could not display it.

Euro character (shown using an image) Euro
Euro in Verdana font (display depends on your browser) I
Euro in Courier New font (display depends on your browser) I
Entity Name €
Microsoft Windows Extended Character Reference €
HTML 4 Character Reference €

Characters with binary values 160 to 255 are also special characters in that they display differently depending on the language (or locale) of the browser and the charset attribute in the meta tag on the web page.

For example,

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-2">

The way characters are displayed can even be controlled from the browser. For example, in IE 5, from the menu bar, select View > Encoding > language of your choice. (You may need to install the IE option for international language support). In Netscape 4.7, select View > Character Set > language. The possible languages are grouped as West European (Latin1), East European (Latin2), Cyrillic, Arabic, Greek, Hebrew, and more. Each of these character sets is defined by ISO 8859.

The ISO 8859 special characters are listed below. Change the encoding of your browser to see the different ways the characters will be displayed.

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

In summary, the following factors affect how a special character is displayed.

  • Browser (Internet Explorer, Netscape, etc.)
  • Version of the browser (3.0, 4.0, 5.0, etc.)
  • Operating System (Windows 95, NT, 2000, Linux, Mac, etc.)
  • Language of the O/S (English, Polish, Arabic, etc.)
  • Font (Times, Arial, Helvetica, Symbol, etc.)
  • Charset attribute in the meta tag (windows-1252, iso-8859-1, etc.)
  • Encoding/Character Set setting of the browser (Western, Central European, UTF-8, etc.)

Many Asian languages, such as Japanese, Korean, and Chinese, are represented by two bytes instead of just one. The binary values for these characters are in the range 256 to 65535. These are mapped as Unicode characters. eWebEditPro can optionally convert these characters to their character reference or leave them as double-byte binary Unicode values (which can be converted to UTF-8). For example, a character whose binary value is 1234 will be converted to "&#1234;".

eWebEditPro can be configured to represent extended and special characters in a number of different ways. They are:

  1. Extended characters, special characters, and double-byte characters as binary (Unicode, which can be converted to UTF-8).
  2. Extended and special characters as their entity name; double-byte characters as their character reference.
  3. Extended characters, special characters, and double-byte characters as their character reference.
  4. Extended characters as their entity name; special characters as binary; double-byte characters as their character reference.
  5. Extended characters as HTML 4 character references; special characters as binary; double-byte characters as their character reference.

charencode Attribute

To configure eWebEditPro, set the charencode attribute of the clean tag in the config.xml file.

For example,

<!-- values for charencode: utf-8, binary, entityname, charref, special, latin -->
<clean enabled="true" charencode="charref" ...>

The values for charencode and their affect are shown in the following table.

Value of charencode Description Sampler
1. utf-8 or binary

Character set (Latin 1)The sampler shows all the characters with binary values 128 to 255.

Characters 128-159 are extended characters. They are listed in two rows that start with 80, which is the hexidecimal representation of 128, and 90.

Characters 160-255 are special characters. They are listed in several rows that start with A0, which is the hexidecimal representation of 160, through F0.

The sampler was displayed using IE 5.0 on English language Windows (Latin1).

Double-byte characters are not shown, but would be their binary value when stored. In View as HTML, they will always appear as their character reference. When viewed in a browser, they will display as the character only if the browser and operating system supports that language.

WARNING: These characters will not display properly unless the operating system supports them. Even if they display in WYSIWYG mode, they will display as character references in View As HTML mode. If stored in a database, the database must support double-byte Unicode or UTF-8 characters. May not be supported in Netscape Navigator 4.

2. entityname

Character set (Latin 1) (charencode=entityname)Extended characters are represented using their entity name (e.g., &euro;) where possible.

Special characters as represented using their entity name (e.g., &nbsp; or &Agrave;).

Double-byte characters are not shown, but would be their character reference.

3. charref Character set (Latin 1) (charencode=charref)Extended characters are represented using their HTML 4 character reference (e.g., &#8364;).

Special characters as represented using their character reference (e.g., &#160; or &#192;).

Double-byte characters are not shown, but would be their character reference.

4. special

Character set (Latin 1) (charencode=special)Extended characters are represented using their entity name (e.g., &euro;) where possible.

Special characters remain as binary, except the non-breaking space, which is represented as &nbsp;.

Double-byte characters are not shown, but would be their character reference.

5. latin

Character set (Latin 1) (charencode=latin)Extended characters are represented using their HTML 4 character reference (e.g., &#8364;).

Special characters remain as binary, except the non-breaking space, which is represented as &#160;.

Double-byte characters are not shown, but would be their character reference.

Choosing a Value

The best charencode value to use will depend upon the environment that the content will be viewed and personal preference for entity names verses character references. If the environment (for example, a database) only supports 7-bit ASCII characters, then either entityname or charref must be used. Values of special or latin will be smaller because the special characters require one byte instead of six or more bytes to represent each character. A value of binary is the smallest for content that consists mostly of Asian characters (e.g., Japanese, Korean, Chinese) because the characters require just two bytes instead of seven or more. Some sites convert Unicode characters to a byte stream format of UTF-8. If your site consistently uses UTF-8, use a value of utf-8.

The following table lists recommended charencode values given certain conditions.

Condition Recommended charencode Value Comments
Database supports only 7 bit characters. entityname or charref Extended and special characters will be corrupted if wrong charencode is selected. Choose between entityname or charref depending on your preference for entity names or character references.
Database supports only 8 bit characters.

any except binary; use utf-8 only if your site uses UTF-8 consistently

Some special and all double-byte characters will be corrupted. If you use UTF-8, you must use it consistently on your site.
Double-byte encoding, typically for an Asian language, and document size is important. binary or utf-8 Database must support Unicode (double-byte) characters. Note: Unicode is not the same as UTF-8. If you use UTF-8, you must use it consistently on your site.
Entity names are always preferred. entityname Extended and special characters will be their entity name.
Entity names are preferred, but in a non-Western European language. special Special characters will be binary for different document encodings, but extended characters will be their entity name.
ISO-8859 (Latin) or windows charset encoding on document, but not Latin1 (that is, not windows-1252 or iso-8859-1). latin or special Choose between latin or special depending on your preference for entity names or character references for extended characters.
Netscape Navigator 4 used for browsing charref Most extended and special characters will appear. Double-byte characters do not appear if the browser or the operating system does not support the language. If another charencode is selected, some extended and special characters may appear as a "?" or their entity name.
UTF-8 charset encoding on document. entityname or charref; use utf-8 only if your site uses UTF-8 consistently Special and double-byte characters will not display correctly as binary. Choose between entityname or charref depending on your preference for entity names or character references. If you use UTF-8, you must use it consistently on your site.
XML without XHTML DTD/Schema. charref; use utf-8 only if your site uses UTF-8 consistently XML only supports a very limited set of entity names unless the XHTML (or other) DTD is provided. If you use UTF-8, you must use it consistently on your site.
Not sure. charref charref works with both UTF-8 encoding and XML parsers. It also gives the best results in Netscape. If special characters always appear as West European letters instead of the proper language, try latin.
More Resources:

How to produce UTF-8 (Ektron Knowledge Base Article)

Character entity references in HTML 4

http://www.w3.org/TR/REC-html40/sgml/entities.html

The ISO 8859 Alphabet Soup

http://wwwwbs.cs.tu-berlin.de/user/czyborra/charsets/

Dan's Web Tips: Characters and Fonts

http://webtips.dan.info/char.html